IMPROVED INFORMATION EXTRACTION FOR TEXT MINING

Academic Open Internet Journal

ISSN 1311-4360

www.acadjournal.com

Volume 22, 2008

IMPROVED INFORMATION EXTRACTION FOR TEXT MINING

WITH SOFT MATCHING RULES AND EXCEPTIONS

A. CHRISTY

RESEARCH SCHOLAR

SATHYABAMA UNIVERSITY

E-mail : Christy_a1@hotmail.com

P.THAMBIDURAI

PROFESSOR

PONDICHERRY ENGINEERING COLLEGE

E-mail : ptdurai58@yahoo.com

ABSTRACT

The popularity of the Internet and the large number of documents available in electronic form in modern world has motivated the search for hidden knowledge in text collections. As the Internet presents numerous sources of useful information, accessing and extracting their content is difficult. Information extraction (IE) software identifies and removes relevant information from texts, pulling information from a variety of sources, and aggregates it to create a single view. Information extraction can be of two types: natural language processing and wrapper induction. In this paper, we propose an algorithm for Information Extraction using NLP technique in the form of scenario template production. Extensive linguistic knowledge is not necessary for successful IE. The system is constructed using inductive learning with soft matching rules combined with exceptions for extracting rules. The system is tested with a collection of documents from NSF Research abstracts and abstracts from two different domains of www.computer.org, we have found the system has improved its recall value after the application of exceptions.

KEYWORDS : Parsing, Feature set Extraction, Rule Extraction, Precision, Recall, etc.

INTRODUCTION

The purpose of text mining is to research technologies to discover useful knowledge from enormous collections of documents, and to develop a system to provide this knowledge and to support in decision making. IE translates content into a homogeneous form through technologies like XML (eXtensible Mark-up Language). The Target of opportunity for IE is to provide texts whose function is the communication of factual information. IE is not a stand-alone task that human analysts typically engage in. The goal of IE software is to transform texts composed of everyday language into a structured, database format. In this way, heterogeneous documents are summarized and presented in a uniform manner. To improve accuracy and ease development, IE software is usually domain or topic specific.

The two basic approaches to the design of IE are Knowledge Engineering Approach and Automatic Training Approach. In Knowledge Engineering approach, Grammars are constructed by hand, Patterns are discovered by domain experts and also the approach requires much laborious tuning and hill climbing, whereas the Automatic Training Approach learns rules from annotated corpora, by interaction with the user and uses statistical methods for discovering patterns whenever possible. A knowledge engineering system is used when linguistic resources like lexicons are available, a skilled rule writer available, training data is sparse, or expensive to obtain, it is critical to obtain the last small increment of performance, and the extraction specifications are likely to change slightly over time. Automatically trained systems are best deployed in complementary situations, where resources other than raw text are unavailable, training data can be easily and cheaply obtained, task specifications are stable, and absolute maximum performance is not critical.

An IE system designed to monitor technical articles about Information Science, for example, could pull out the names of professors, research studies, and topics of interest, conferences, and forthcoming publications from press releases, news stories, or emails and encode this information in a database. End-users can then search across this database by textual attribute or feature.

IE can be further facilitated using bag of words, Natural language Processing (NLP) techniques, Conditional Random field, Machine learning, patterns in the form of templates, etc. are some of the methods used for transforming the documents into a collection of concepts, in the structured form. In this paper, we propose a rule induction method for automatically extracting the key information like aim, methodology and conclusion specified by authors in technical abstracts and also from NSF Research Abstracts in which the information is extracted in the form of Named Entity and scenario template production form. When discovering patterns in text, strict matching of strings is inadequate because textual database entries generally exhibit variations due to its usage. In order to find the matching of strings, we have use the soft-matching rules in the inductive method with exceptions that integrates rule-based and automatic training approach.

The goal of an IE system is to find specific data in natural-language texts. The data to be extracted is typically given by a template which specifies a list of slots to be filled with substrings taken from the document. Linguistic evidence shows that an abstract in a given domain follows a prototypical and even modular organization. From a scientific viewpoint, there are also claims that important findings could be searched by linking this kind of information across the documents [1]. It is observed that authors use a definite set of verbs for stating the predicates. For e.g., ‘propose’ is a verb which is always used by the authors to indicate the aim and ‘show’ is a verb which is always used to state the conclusion. IE task extracts information at a rhetorical and semantic level, and uses the information in a rule-like form to represent the document’s key facts. Computationally, rule + exception strategies focus on summarizing and understanding data in concise forms. Rule accuracy isn’t necessarily a preferred criterion for rule generation, but the system’s overall accuracy can be improved by adding exceptions [4].

MOTIVATION AND BACKGROUND

Previous studies have shown that bag of words, natural language processing techniques and patterns in the form of templates as some of the techniques adopted in IE [1] [2] [3] [6] . Some of the other systems existing for IE is extracting information on interacting proteins from biomedical text using manually developed patterns( Suiseki system , 2002), extracting the names of organizations and their headquarters by generating patterns and extracting tuples from plain-text documents (Snowball system ), a genre-based extraction patterns using natural language processing techniques for extracting the rhetoric information contained in technical abstracts (John Atkinson-Abutridy, et. al 2004) , extracting a database from postings to the USENET newsgroup, Austin.jobs, etc. using predefined templates (Raymond J. Mooney et.al 2005), etc. Huang and Zweig & Padmanabhan (IBM Yorktown) in “Information Extraction from Voicemail” has retrieved the identity and contact numbers of callers, Hsin-Hsi Chen, et.al in “Description of a Topic Detection Algorithm on TDT3 Mandarin Text”, has used the similarity metrics for finding the threshold based on which the topics are detected either relevant or irrelevant[7]. IE has been shown to be useful in a variety of other applications like seminar announcements, restaurant guides, university web pages, apartment rental ads, news articles, etc.

Formally evaluating NLP information extraction software requires compiling a corpus of texts and manually creating an answer key. Texts are run through an IE software system and a template of answers is produced. This template is then measured against an answer key that specifies what information should be extracted from a text. Each software system fills empty template slots with appropriate values derived from the test documents.

The motivation for IE is that it is representative of a general challenge for learning. Given a set of concepts to be recognized and labeled, how can an algorithm learn to reliably spot and label new examples of the concepts ? IF-THEN rules are one of the most expressive representations for learned hypotheses. The rules can be used to categorize previously unseen data. In general, classification rule learning methods extract rules from decision trees, adopt sequential set covering algorithms, or translate neural nets into human-readable rules. Most of them assume that examples are represented as feature vectors, the components of which are either real numbers or nominal values [3]. Two widely-used schemes for rule-learning are C4.5 rules and Ripper. Both of them generate concise and human-readable outputs, rule sets. C4.5 rules induces rules from binary data by learning decision trees and translating them into pruned rules. The algorithm generates a set of rules for each path from a tree learned by C4.5. It then checks if the rules can be generalized by dropping conditions. C4.5 can handle training data with continuous attributes or missing attribute values and has been successfully applied to a wide variety of machine-learning tasks. Ripper is a fast rule learner with an ability to handle set-valued features. Ripper, based on the incremental reduced error pruning algorithm, splits the available training data into a growing set and pruning set before learning rules. It is shown that Ripper scales nearly linearly with the number of examples in the training set.

ARCHITECTURE OF IE SYSTEMS

The general architecture of an IE system comprises the components as depicted in Fig.1. The process starts with the splitting of a given sentence into various tokens (words), from which the stop words, such as the, a, an, it, etc. are removed, as they contribute no meaning for recognition of key terms used for IE. The Morphological and lexical processing concerns how words are constructed from more basic meaning units called morphemes. A morpheme is the primitive unit of meaning in a language (for e.g., the meaning of the word proposed is derivable from the meaning of the verb propose (the stem word) and the inclusion of suffixes may transforms a verb into adverb. The morphological processing deals with the identification of stem word, which is a verb. Syntactic Analysis concerns how words can be put together to form correct sentences and determines what structural role each word plays in the sentence and what phrases are subparts of what other phrases. Domain analysis includes the general knowledge about the structure of the world that language users must have in order to, for example, maintain efficient knowledge discovery

SIMILARITY METRICS

Most of the existing text mining techniques discover rules requiring an exact match. However, due to the heterogeneity problem, a form of soft-matching is needed to construct an effective text mining system. Soft-matching requires a method to determine the distance between two textual items or documents. Similarity of text can be measured using standard bag of words (BOW) or edit-distance measures. The TFIDF (Term Frequency, Inverse Document Frequency) weighting scheme is used to assign higher weights to distinguished terms in a document. TFIDF makes two assumptions about the importance of a term. First, the more a term appears in the document, the more important it is (term frequency). Second, the more it appears through the entire collection of documents, the less important it is since it does not characterize the particular document well (inverse document frequency). In the TFIDF framework, the weight for term tj in a document di, wij is defined as follows:

w_ij = tf_ij* log₂ N/n

where tfij is the frequency of term tj in document di, N is the total number of documents in a collection, and n is the number of documents in which term tj occurs at least once.

INFORMATION EXTRACTION SYSTEM

In order to extract this initial key information from the texts, an IE module was built. Essentially, it takes a set of text documents, has them tagged through a previously trained part-of-speech (POS) tagger, and produces an intermediate representation for every document (i.e., template, in an IE sense), which is then converted into a general rule. A set of hand-crafted domain-independent extraction patterns as shown in fig.2 were written and coded. For this purpose, each pattern constructs an output representation which involves linguistic knowledge: the rhetorical role (aim, method, conclusion) and the action represented by the predicate relation and its arguments (partial sentences) . From the Fig.2, present, introduce and show identified as the verbs are called as predicates and the sequence of token describing the sentence is called as the arguments

( ‘Document 1’, [% Preceding information

aim (present ( ‘data mining, visualization ,

techniques……..’ ) ),

method (introduce ( ‘powerful, visual, data mining ,

environment..’ )) ],

{% Succeeding information

conclusion ( show ( ‘learn, more, more, quickly, such,

integrated...’ ))

}

)

Fig.2

FEATURE EXTRACTION

Feature extraction play a vital role in information extraction, since the template is filled only by the occurrence of the specified feature. We have used Bagging, which works by training each classifier on a random sample from the training set. Bagging has the important advantage that, it is effective on “unstable learning algorithms” (Breiman 1996), where small variations in the input data can cause large variations in the learned theories. A second advantage of bagging algorithm is highly parallel and thus makes the pattern matching effective. The model parses the sentences using trigram model and lexical analysis is carried out . If the current token is a verb and the previous token is any one among Noun, Pronoun or prep and if the next token is any one among Det,Pro, Prep,Noun or Adj then the verb is identified as the predicate to be retrieved.

If Previous token is in the list (Noun, Pronoun, Prep) and if the next token is one among the token in the list (Det, Pro, Prep, Noun, Adj) , Using Soft-matching rules, both the sets can be related together. The Exception included in the model is “if the current token is “is (be), ” retrieve the next verb. If the next token is among one among the set “Conj, Adv” do not consider for generating rules. Certain nouns like “aim”, “objective” can also be used for extracting predicates. Another testing data used for extracting information is collected from NSF Research Award abstracts. In this dataset, the nouns line “aim”, “objective” plays a major role in filling the patterns. For the tokens those are extracted, calculate the term frequency using the inverse document frequency formula, (log (N) /df), where N is the total number of documents and df is the corresponding document frequency. The inverse document frequency values for each and every tokens are found and are arranged in their descending order inorder to search through the entire documents. If the calculated values lies between 0 to 1 those are considered as the valid tokens. If the value is very close to o, it satisfies the condition and extracts the exact rule, whereas if the value exceeds 1, we retrieve the next verb as the token. If any token fails to fall between 0 to 1, the entire rule is discarded.

During semantic analysis, using the Hidden Markov model, the system verifies whether the particular verb indicates the predicates and the validations done on two different domains of documents. The abstracts are taken from two different domains related to information retrieval and image processing from www.computer.org from the verbs used for stating the aim and conclusion are same in different domains and for stating the methodology , the verbs (tokens) are not common. Although there are no explicit causal relations in the above example, we can hypothesize a simple rule of the form.

IF the current goal is G1 ,... and the method used is M1,... THEN

it is true that we can achieve the conclusion C1,...

Data mining methods generally require terms in discovered rules to exactly match database entries. Normal variation in textual data items can therefore prevent the discovery of important regularities. Variations can arise from typographical errors, misspellings, abbreviations, as well as other sources. Variations are particularly pronounced in data that is automatically extracted from unstructured or semi-structured documents or web pages. An example for a sample filled template, from where the data is extracted is shown in fig.3)

Fig.3

The soft-matching rules allow non-standardized database entries to match antecedents and consequents based on relevant similarity metrics. A benefit of association-rule mining instead of classification-rule induction is that consequents of rules are not predetermined, resulting in efficient mining of all potential associations as part of a single process. When inducing classification rules, a separate learning step must

be run for predicting each possible item. The algorithm specification adopted for rule rule mining and information extraction are depicted in fig. 4 and fig.5

Input : D is the set of documents

Output : R is the set of rules

Function Rulemining(D)

Fig.4 Algorithm specification : Rule mining

Input : R is the set of prediction rulew

Output : P is the set of predicates extracted

Function InformationExtraction(R, D)

Fig.5 Algorithm specification : Information extraction

We mine rules for identifying the occurrence of the current token preceded and followed by the proper order specified and finding the threshold (inverse document frequency , between 0 and 1). If the rules satisfies the condition, they are added to the rule, else it is pruned. For each rule extracted, see whether the training set of data matches the current token, if it matches the rules are extracted and stored in the structured format. Out of 500 documents extracted from www.computer.org, we have extracted 120 rules and from 750 documents collected from NSF Research Abstracts, 200 rules are finally extracted. The accuracy of the rules extracted are found and the results are tabulated as shown in fig. 6.